A Corpus-based Approach to the Interpretation of Unknown Words with an Application to German
نویسنده
چکیده
Abstract Usually a high portion of the different word forms in a corpus receive no reading by the lexical and/or morphological analysis. These unknown words constitute a huge problem for NLP analysis tasks like POS-tagging or syntactic parsing. We present a parameterizable (in principle language-independent) corpus-based approach for the interpretation of unknown words that only needs a tokenized corpus and can be used in both offline and online applications. In combination with a few linguistic (language-dependent) rules unknown verbs, adjectives, nouns, multiword units etc. are identified. Depending on the recognized word class(es), more detailed morphosyntactic and semantic information is additionally identified in opposite to the majority of other unknown word guessing methods, which only uses a very narrow decision window to assign an unknown word its correct reading respective Part-of-Speech tag in a given text. We tested our approach by experiments with German data and received very promising results.
منابع مشابه
روشی جدید جهت استخراج موجودیتهای اسمی در عربی کلاسیک
In Natural Language Processing (NLP) studies, developing resources and tools makes a contribution to extension and effectiveness of researches in each language. In recent years, Arabic Named Entity Recognition (ANER) has been considered by NLP researchers due to a significant impact on improving other NLP tasks such as Machine translation, Information retrieval, question answering, query result...
متن کاملThe Rhetorical - Aesthetic Approach to Constructing the Relation between Images and Visual Inventions with Global Politics
Images and photos play an important role in our understanding of domestic and international events. Today we are living in the age of the visualization of politics. The images are vague, rhetorical, and aesthetic components of political and social phenomena and can give them a beautiful or detestable structure. In the digital age, images in and of themselves can define our structure and vision ...
متن کاملAutomatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation
Purpose: This study investigates the automatic keyword extraction from the table of contents of Persian e-books in the field of science using LDA topic modeling, evaluating their similarity with golden standard, and users' viewpoints of the model keywords. Methodology: This is a mixed text-mining research in which LDA topic modeling is used to extract keywords from the table of contents of sci...
متن کاملDictionary of Abstract and Concrete Words of the Russian Language: A Methodology for Creation and Application
The paper describes the first stage of a project on creating an electronic dictionary with numerical estimates of the degree of abstractness and concreteness of Russian words. Our approach is to integrate data obtained from several different sources: text corpora, psycholinguistic experiments, published dictionaries, markers of abstractness (certain suffixes) and a translation of a similar dict...
متن کاملThe effect of smartphone application-based learning on intensive care nurses' knowledge about the arterial gas interpretation
Background : Arterial blood gas interpretation is necessary to train nurses in intensive care units. The need to use modern educational methods to improve the knowledge of nurses with long-term durability is considered important. Regarding the importance of using technology in nursing education and the necessity of interpretation of arterial blood gas testing, this study was conducted to invest...
متن کامل